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Abstract 



This paper presents a deep neural-network-based hierarchical graphical model for in¬ 
dividual and group activity recognition in surveillance scenes. Deep networks are used 
to recognize the actions of individual people in a scene. Next, a neural-network-based 
hierarchical graphical model refines the predicted labels for each class by considering 
dependencies between the classes. This refinement step mimics a message-passing step 
similar to inference in a probabilistic graphical model. We show that this approach can be 
effective in group activity recognition, with the deep graphical model improving recog¬ 
nition rates over baseline methods. 


1 Introduction 

Event understanding in videos is a key element of computer vision systems in the context of 
visual surveillance, human-computer interaction, sports interpretation, and video search and 
retrieval. Therefore events, activities, and interactions must be represented in such a way 
that retains all of the important visual information in a compact and rich structure. Accu¬ 
rate detection and recognition of atomic actions of each individual person in a video is the 
primary component of such a system, and also the most important, as it affects the perfor¬ 
mance of the whole system signihcantly. Although there are many methods to determine 
human actions in uncontrolled environments, this task remains a challenging computer vi¬ 
sion problem, and robust solutions would open up many useful applications. The standard 
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and yet state-of-the-art pipeline for activity recognition and interaction description consists 
of extracting hand-crafted local feature descriptors either densely or at a sparse set of interest 
points (e.g., HOG, MBH, ...) in the context of a Bag of Words model [E3]. These are then 
used as the input either to a discriminative or a generative model. In recent years, it has 
been shown that deep learning techniques can achieve state-of-the-art results for a variety of 
computer vision tasks including action recognition [O, [H]. 

On the other hand, understanding of complex visual events in a scene requires exploita¬ 
tion of richer information rather than individual atomic activities, such as recognizing local 
pairwise and global relationships in a social context and interaction between individuals 
and/or objects [B, O, D, US, E3]. This complex scene description remains an open and 
challenging task. It shares all of difficulties of action recognition, interaction modeling', and 
social event description. Formulating this problem within the probabilistic graphical models 
framework provides a natural and powerful means to incorporate the hierarchical structure 
of group activities and interactions [O, O]. Given the fact that deep neural networks can 
achieve very competitive results on the single person activity recognition tasks, they can, 
produce better results when they are combined with other methods, e.g. graphical models, 
in order to capture the dependencies between the variables of interest [ED]. Following a sim¬ 
ilar idea of incorporating spatial dependency between variables into the deep neural network 
in a joint-training process presented [ED], here we focus on learning interactions and group 
activities in a surveillance scene by employing a graphical model in a deep neural network 
paradigm. 

In this paper, our main goal is to address the problem of group activity understanding 
and scene classification in complex surveillance videos using a deep learning framework. 
More specifically, we are focused on learning individual activities and describing the scene 
simultaneously while considering the pair-wise interactions between individuals and their 
global relationship in the scene. This is achieved by combining a Convolutional Network 
(ConvNet) with a probabilistic graphical model as additional layers in a deep neural network 
architecture into a unified learning framework. The probabilistic graphical models can be 
seen as a refining process for predicting class labels by considering dependencies between 
individual actions, body poses, and group activities. The probabilistic graphical model is 
modeled by a multi-step message passing neural network and the predicted label refinement 
is carried out through belief propagation layers in the neural network. Figure 1 depicts an 
overview of our approach for label refinement. Experimental results show the effectiveness 
of our algorithm in both activity recognition and scene classification. 

2 Previous Work 

The analysis of human activities is an active area of research. Decades of research on this 
topic have produced a diverse set of approaches and a rich collection of activity recognition 
algorithms. Readers can refer to recent surveys such as Poppe [DID] and Weinland et al. [E3] 
for a review. Many approaches concentrate on an activity performed by a single person, 
including state of the art deep learning approaches [O, D3]. 

In the context of scene classification and group activity understanding, many approaches 
use a hierarchical representation of activities and interactions for collective activity recogni¬ 
tion [O]. They have been focused on capturing spatio-temporal relationships between visual 

*The term “interaction” refers to any kind of interaction between humans, and humans and objects that are 
present in the scene, rather than activities which are performed by a single subject. 
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Figure 1; Recognizing individual and group activities in a deep network. Individual action 
labels are predicted via CNNs. Next, these are refined through a message passing neural 
network which considers the dependencies between the predicted labels. 



1^' step message 2'"' step message 

passing passing 


Figure 2; A schematic overview of our message passing CNN framework. Given an image 
frame and the detected bounding boxes around each person, our model predicts scores for 
individual actions and the group activities. The predicted labels are refined by applying a 
belief propagation-like neural network. This network considers the dependencies between 
individual actions and body poses, and the group activity. The model learns the message 
passing parameters and performs inference and learning in unified framework using back- 
propagation. 
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cues either by imposing a richer feature descriptor which accounts for context [□, mi] or a 
context-aware inference mechanism [□, □]. Hierarchical graphical models [□, O, 113, US], 
AND-OR graphs [□, 0], and dynamic Bayesian networks [E3] are among the representative 
approaches for group activity recognition. 

In traditional approaches, local-hand crafted features/descriptors has been employed to 
recognize atomic activities. Recently, it has been shown that the use of deep neural networks 
can by itself outperform other algorithms for atomic activity recognition. However, no prior 
art in the CNN-based video description used activities and scene information jointly in a 
unified graphical representation for scene classification. Therefore, the main objective of 
this research is to develop a system for activity recognition and scene classification which 
simultaneously uses the action and scene labels in a neural network-based graphical model 
to refine the predicted labels via a multiple-step message passing. 

More closely related to our approach is work combining graphical models with convo¬ 
lutional neural networks [0, ED]. In [ED], a one step message passing is implemented as 
a convolution operation in order to incorporate spatial relationship between local detection 
responses for human body pose estimation. In another study, Deng et al. [B] propose an 
interesting solution to improve label prediction in large scale classification by considering 
relations between the predicted class labels. They employ a probabilistic graphical model 
with hard constraints on the labels on top of a neural network in a joint training process. 
In essence, our proposed algorithm follows a similar idea of considering dependencies be¬ 
tween predicted labels for the actions, group activities, and the scene label to solve the group 
activity recognition problem. Here we focus on incorporating those dependencies by im¬ 
plementing the label refinement process via an inter-activity neural network, as shown in 
Figure 2. The network learns the message passing procedure and performs inference and 
learning in unified framework using back-propagation. 


3 Model 

Considering the architecture of our proposed structured label refinement algorithm for group 
activity understanding (see Figure 2), the key part of the algorithm is a multi-step message 
passing neural network. In this section, we describe how to combine neural networks and 
graphical models by mimicking a message passing algorithm and how to carry out the train¬ 
ing procedure. 

3.1 Graphical Models in a Neural Network 

Graphical models provide a natural way to hierarchically model group activities and capture 
the semantic dependencies between group and individual activities [O]. A graphical model 
defines a joint distribution over states of a set of nodes. For instance, one can use a factor 
graph, in which each (j), corresponds to a factor over a set of related variable nodes x, and y, , 
and models interactions between these nodes in a log-linear fashion: 

P{X,Y) oc Yl(l)i{xi,yi) oc exp(^Wkfk{x,y)) (1) 

i k 

where X are the inputs and Y the predicted labels, with weighted (wk) feature functions fk- 
When performing inference in a graphical model, belief propagation is often adopted as 
a way to infer states or probabilities of variables. In the belief propagation algorithm, each 
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Figure 3; Weight sharing scheme in neural network. We use a sparsely connected layer 
to represent message passing between variable nodes and factor nodes. Each factor node 
only connects to its relevant nodes. And factor nodes of same type share a template of 
parameters. For example, factor node 1 and 2 gathers information from a scene’s scenel, a 
person’s actioni and posel, and share one template of parameters. And factor node 3 adopts 
another set of weights. 


step of message passing first collects relevant information from connected nodes to a factor 
node, which represents the joint distribution (dependencies) over states, then passes these 
messages to variable nodes by marginalizing over states of irrelevant variables. 

Following this idea, we mimic the message passing process by representing each combi¬ 
nation of states as a neuron in neural network, denoted as a “factor neuron.” While normal 
message passing calculates dependencies rigidly, a factor neuron can be used to learn and 
predict dependencies between states and pass messages to variable nodes. In the setting of 
neural networks, this dependency representation becomes more flexible and can adopt varied 
types of neurons (linear, ReFU, Sigmoid, etc.). Moreover, by integrating graphical models 
into a neural network, the formulation of a graphical model allows for parameter sharing 
in the neural network, which not only reduces the number of free parameters to learn but 
also accounts for semantic similarities between factor neurons. Fig. 3 shows the parameter 
sharing scheme for different factor neurons. 


3.2 Message Passing CNN Architecture for Group Activity 

Representing group activities and individual activities as a hierarchical graphical model has 
proven successful [□, B, O]. We adopt a similar structured model that considers group activ¬ 
ity, individual activity, and group-individual interactions together. We introduce a new mes¬ 
sage passing Convolutional Neural Network framework as shown in Fig. 2. Our model has 
two main stages; (1) fine-tuned Convolutional Neural Networks that produce scene scores 
for a frame, and action and pose scores for each person in that frame; (2) Message Passing 
Neural Network phase capturing dependencies. 

Given an image I and a set of person detections {/i,/ 2 ,...,/m}, the first stage of our 
model outputs raw scores of scene, action and poses for image I and all detections /,■ in the 
image using fine-tuned CNNs. After a softmax normalization for each scene and person, 
these raw scores are taken as input of the graphical model part in the second stage. In the 
graphical model, outputs from CNNs correspond to unary potentials. Denote the scene- 
level, and per-person action and pose-level unary potentials for frame I as s**')(/), a**(/m), 
respectively. The superscript (0) is the index of message passing steps. We use G to 
denote all group activity labels, H to represent all the action labels and Z to denote all the 
pose labels. Then the group activity in one scene can be represented as gi, {/i/j ...,/i/j^}, 

{zii ,2/2 5 where g/ S G is the group activity label for image /, hj. and z/. are action 

labels and pose labels for a person /„,. 
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Note that for training, the scene, action, and pose CNN models in stage 1 are fine-tuned 
from an AlexNet architecture pretrained using ImageNet data. The architecture is similar to 
that proposed by [D] for object classification with some minor differences such as pooling 
is done before normalization. The network consists of five convolutional layers followed by 
two fully connected layers, and a softmax layer that outputs individual class scores. We use 
the softmax loss, stochastic gradient descent and dropout regularization to train these three 
ConvNets. 

In the second stage, we use the method mentioned in Sec. 3.1 to mimic message passing 
in a hierarchical graphical model for group activity in a scene. This stage can contain several 
steps of message passing. In each step, there are two types of passes: from outputs of step k — 
1 to factor layer and from factor layer to k step outputs. In the message passing step, the 
first pass computes dependencies between states. The inputs to the step message passing 
are where if 

is the scene score of image I for label g, af (/„,) is the action score of person /„, for label h 

and rf 4n)) is the pose score of person Im for label z. In the factor layer, the action, pose 
and scene interaction are calculated as: 




( 2 ) 


where is a 3-d parameter template for combination of scene g, action h and pose z. 
Similarly, pose interactions for all people in the scene are calculated as: 


Wj{4 


( 3 ) 


where r is all output nodes for all people, t is the factor neuron index for scene g. T latent 
factor neurons are used for a scene g. Note that parameters a and j3 are shared within factors 
that have the same semantic meaning. For the output of k^^' step message passing, the score 
for the scene label to be g can be defined as: 

+ E -f E (4) 

Jeef 

where ej and e| are the set of factor nodes that connected with scene g in first factor 
component(scene-action-pose factor) and second factor component (pose-global factor) re¬ 
spectively. Similarly, we also define action and pose scores after the k*^' message passing 
step as: 

mi 

Note that e = {ef,e 2 ,£i,£{,£ 2 } are connection configurations in the pass from factor neu¬ 
rons to output neurons. These connections are simply the reverse of the configurations in the 
first pass, from input to factors. The model parameters {W, a, j3} are weights on the edges 
of the neural network. Parameter W represents the concatenation of weights connected from 
factor layers to output layer (second pass), while a,)3 represent weights from the input layer 
of the k'^ message passing to factor layers (first pass). 
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3.2.1 Components in Factor Layers 

Now we explain in detail the different components in our model. 

Unary component: In our message passing model, the unary component corresponds 
to group activity scores for an image 7, action and pose scores for each person /„, in frame 
I, represented as sf *^(7), af ^\lm) and *^(7„,) respectively. These scores are acquired 
from the previous step of message passing and are directly added to the output of the next 
message passing step. 

Group activity-action-pose factor layer A group’s activity is strongly correlated 
to the participating individuals’ actions. This component for the model is used to measure 
the compatibility between individuals and groups. An individual’s activity can be described 
by both pose and action, and we use this ternary scene-pose-action factor layer to capture 
dependencies between a person’s fine-grained action (e.g. talking facing front-left) and the 
scene label for a group of people. Note that in this factor layer we used the weight sharing 
scheme mentioned in Sec. 3.1 to mimic the belief propagation. 

Poses-all factor layer y/: Pose information is very important in understanding a group 
activity. For example, when all people are looking in the same direction, there is a high 
probability that it’s a queueing scene. This component captures this global pose information 
for a scene. Instead of naively enumerate all combination of poses for all people, we exploit 
the sparsity of truly useful and frequent patterns, and simply use T factor nodes for one scene 
label. In our experiments, we simply set T to be 10. 


3.3 Multi Step Message Passing CNN Training 

The steps of message passing depends on the structure of graphical model. In general, graph¬ 
ical models with loops or large number of levels will lead to more steps belief propagation 
for sharing local information globally. In our model, we adopt two message passing steps, 
as shown in Fig. 2. 

Multi-loss training: Since the goal of our model is to recognize group activities through 
global features and individual actions in that group, we adopt an alternative strategy for 
training the model. For the message passing step, we first remove the loss layers for 
actions and poses to learn parameters for group activity classification alone. In this phase, 
there is no back-propagation on action and pose classification. Since group activity heavily 
depends on an individual’s activity, we then fix the softmax loss layer for scene classification 
and learn the model for actions and poses. The trained model is used for the next message 
passing step. Note that in each message passing step, we exploit the benefit of the neural 
network structure and jointly trained the whole network. 

Learning semantic features for group activity: Traditional convolutional neural net¬ 
works mainly focus on learning features for basic classification or localization tasks. How¬ 
ever, in our proposed message passing CNN deep model, we not only learn features, but also 
learn semantic high-level features for better representing group activities and interactions 
within the group. We explore different layers’ features for this deep model, and results show 
that these semantic features can be used for better scene understanding and classification. 

Implementation details: Firstly, in practice, it is not guaranteed that every frame has 
the same number of detections. However, the structure of neural network should be fixed. 
To solve this problem, denoting M,„ax as the maximum number of people contained in one 
frame, we do a dummy-image padding when the number of people is less than M^ax- Then 
we filter out these dummy data by de-activating neurons connected with them in related 
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layers. Secondly, After the first message passing step, instead of directly feeding the raw 
scores into the next message passing step, we first normalize the pose and action scores for 
each person and scene scores for one frame by a softmax layer, converting to probabilities 
similar to belief propagation. 


4 Experiments 

Our models are implemented using the Caffe library [DU] by defining two types of sparsely 
connected and weight shared inner product layers. One is from variable nodes to factor 
nodes, another is the reverse direction. We used TanH neurons as the non-linearity of these 
two layers. To examine the performance of our model, we test our model for scene classifi¬ 
cation on two datasets: (1) Collective Activity [0], (2) a nursing home dataset consisting of 
surveillance videos collected from a nursing home. 

We trained an RBF kernel SVM on features extracted from the graphical model layer 
after each step of message passing model. These SVMs are used to predict scene labels for 
each frame, the standard task in these datasets. 

4.1 Collective Activity Dataset 

The Collective Activity Dataset contains 44 video clips acquired using low resolution hand¬ 
held cameras. Every person is assigned one of the following five action labels: crossing, 
waiting, queuing, walking and talking and one of the eight pose labels: right, front-right, 
front, front-left, left, back-left, back, back-right. Each frame is assigned one of the following 
five activities: crossing, waiting, queueing, walking, and talking. The activity category is 
attained by taking the majority of actions happening in one frame while ignoring the poses. 
We adopt the standard training test split used in [O]. 

In the Collective Activity dataset experiment, we further concatenate the global features 
for a scene with AC descriptors by HOG features [O]. We simply averaged AC descriptors 
features for all people and use this feature to serve as additional global information, namely 
this feature does not truly participated in the message passing process. This additional global 
information assists in classification with the limited amount of training data available for this 
dataset”. 

We summarize the comparisons of activity classification accuracies of different methods 
in Tab. 1. The current best result using spatial information in graphical model is 79.1%, 
from Lan et al. [O], which adopted a latent max-margin method to learn graphical model 
with optimized structure. Our classification accuracies (the best is 80.6%) are competitive 
compared with the state-of-the-art methods. However, the benefits of the message passing 
are clear. Through each step of the message passing, the factor layer effectively captured 
dependencies between different variables and passing messages using factor neurons results 
in a gain in classification accuracy. Some visualization results are shown in Fig 4 



1 Step MP 

2 Steps MP 

Latent Constituent [0] 

75.1% 

Pure DL 

73.6% 

78.4% 

Contextual model [O] 

79.1% 

SVMh-DL Feature 

75.1% 

80.6% 

Our Best Result 

80.6% 


Table 1: Scene classification accuracy on the Collective Activity Dataset. 


^Scene classification accuracy solely using AlexNet is 48%. 
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4.2 Nursing Home Dataset 

This dataset consists 80 videos and is captured in a nursing home, including a variety of 
rooms such as dining rooms, corridors, etc. The 80 surveillance videos are recorded at 640 
by 480 pixels at 24 frames per second, and contain a diverse set of actions and frequent 
cluttered scenes. This dataset contains typical actions include walking, standing, sitting, 
bending, squatting, and falling. For this dataset, the goal is to detect falling people, thus we 
assign each frame one of two activity categories: fall and non-fall. A frame is assigned “fall" 
if any person falls and “non-fall" otherwise. Note that many frames are challenging, and the 
falling person may be occluded by others in the scene. We adopted a standard 2/3 and 1/3 
training test split. In order to remove redundancy, we sampled 1 out of every 10 frames for 
training and evaluation. Since this dataset has a large intra-class diversity within actions, we 
used the action primitive based detectors proposed in [O] for more robust detection results. 

Note that since this dataset has no pose attribute, we simply used one scene-action factor 
layer to perform the two step message passing. For the SVM classifier, only deep learning 
features are used. We summarize the comparisons of activity classification accuracies of 
different methods in Table 2. 


Ground Truth 

Pure DL 

SVMh-DL Fea. 

Detection 

Pure DL 

SVMh-DL Fea. 

1 Step MP 

82.5% 

82.3% 

1 Step MP 

74.4% 

76.5% 

2 Steps MP 

84.1% 

84.7% 

2 Steps MP 

75.6% 

77.3% 


Table 2: Classification accuracy on the nursing home dataset 


The scene classification accuracy on the Nursing Home dataset by using a baseline 
AlexNet model is 69%. The results on scene classification for each step also shows gains. 
Note that in this dataset, accuracy on the second message passing gains an increase of around 
1.5% for both pure deep learning or SVM prediction. We believe that this is due to the fact 
that the dataset only contains two scene labels, fall or non-fall, so scene variables are not as 
informative as scenes in the Collective Activity Dataset. 

5 Conclusion 

We have presented a deep learning model for group activity recognition which jointly cap¬ 
tures the group activity, the individual person actions, and the interactions between them. 
We propose a way to combine graphical models with a deep network by mimicking the 
message passing process to do inference. We successfully applied this model to real scene 
surveillance videos and showed its’ effectiveness in recognizing the activity of a group of 
people. 
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Figure 4: Results visualization for our model. Green tags are ground truth, yellow tags are 
predicted labels. From left to right is without message passing, first step message passing 
and second step message passing 
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